V-measure score (v_measure_score)#
V-measure is an external clustering metric: it evaluates a clustering labels_pred using known ground-truth labels labels_true.
It combines two complementary requirements:
Homogeneity: each predicted cluster contains only members of a single class (pure clusters)
Completeness: all members of a given class are assigned to the same cluster (do not split classes)
V-measure is the (weighted) harmonic mean of homogeneity and completeness, so it is high only when both are high.
Learning goals#
Build intuition for homogeneity vs completeness
Derive V-measure from entropy / mutual information
Implement
v_measure_scorefrom scratch in NumPyUse V-measure to tune a simple clustering algorithm (K-means) when labels are available
Know pros/cons, pitfalls, and when to use it
Quick import#
from sklearn.metrics import v_measure_score
When should you use it?#
Use V-measure when you have ground-truth categories (or a labeled validation set) and want to evaluate or compare clustering results.
If you do not have labels, prefer internal metrics (e.g., silhouette) or task-specific evaluation.
Intuition: merge vs split#
Think of the true labels as colors (classes) and the clustering output as groups (clusters).
If a cluster mixes many colors → it is not homogeneous.
If a color is scattered across many clusters → it is not complete.
V-measure forces a balance.
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import (
completeness_score,
homogeneity_score,
normalized_mutual_info_score,
v_measure_score,
)
pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
Warm-up: four clusterings of the same labeled dataset#
Below we create four different labels_pred arrays for the same labels_true:
Perfect (up to permutation of cluster IDs)
Under-clustering: everything in one cluster (complete but not homogeneous)
Over-clustering: split each class into multiple clusters (homogeneous but not complete)
Random assignment
n_per_class = 60
labels_true = np.repeat(np.arange(3), n_per_class)
# 1) perfect, but with a permutation of cluster IDs
perm_map = {0: 2, 1: 0, 2: 1}
labels_pred_perfect = np.vectorize(perm_map.get)(labels_true)
# 2) under-clustering: everything in one cluster
labels_pred_one_cluster = np.zeros_like(labels_true)
# 3) over-clustering: split each class into 2 clusters
# class 0 -> clusters 0/1, class 1 -> clusters 2/3, class 2 -> clusters 4/5
split_bit = np.arange(labels_true.size) % 2
labels_pred_split_each_class = labels_true * 2 + split_bit
# 4) random assignment into 3 clusters
labels_pred_random = rng.integers(0, 3, size=labels_true.size)
cases = {
'perfect (permute ids)': labels_pred_perfect,
'one cluster (merge classes)': labels_pred_one_cluster,
'split each class (over-cluster)': labels_pred_split_each_class,
'random (3 clusters)': labels_pred_random,
}
rows = []
for name, labels_pred in cases.items():
h = homogeneity_score(labels_true, labels_pred)
c = completeness_score(labels_true, labels_pred)
v = v_measure_score(labels_true, labels_pred)
rows.extend(
[
{'case': name, 'metric': 'homogeneity', 'value': h},
{'case': name, 'metric': 'completeness', 'value': c},
{'case': name, 'metric': 'v_measure', 'value': v},
]
)
fig = px.bar(
rows,
x='case',
y='value',
color='metric',
barmode='group',
title='Homogeneity vs completeness vs V-measure on simple labelings',
)
fig.update_layout(yaxis=dict(range=[0, 1.05]))
fig.show()
How to read the plot#
One cluster: completeness is 1 (each class is fully contained in one cluster), but homogeneity is low (the cluster mixes classes).
Split each class: homogeneity is 1 (each cluster is pure), but completeness is low (each class is spread across clusters).
V-measure penalizes both extremes.
Definition (information-theoretic)#
Let:
\(C\) be the random variable for the true class label
\(K\) be the random variable for the predicted cluster label
\(n_{ck}\) be the contingency table counts (how many samples of class \(c\) are assigned to cluster \(k\))
\(N = \sum_{c,k} n_{ck}\)
From the contingency table we get marginals:
\(n_c = \sum_k n_{ck}\)
\(n_k = \sum_c n_{ck}\)
\(p(c) = n_c / N\), \(p(k) = n_k / N\), \(p(c,k) = n_{ck} / N\)
Entropy and mutual information#
(Any log base works; V-measure is a ratio, so the base cancels.)
Homogeneity and completeness#
Homogeneity:
Completeness:
Edge cases:
If \(H(C)=0\) (only one true class), define \(h=1\).
If \(H(K)=0\) (only one predicted cluster), define \(c=1\).
V-measure#
With a weight \(\beta \ge 0\) (bigger \(\beta\) emphasizes completeness):
For \(\beta=1\) it is symmetric and equivalent to normalized mutual information with arithmetic normalization:
NumPy implementation (from scratch)#
We will implement everything from the contingency table up.
Notes:
We use natural logarithms (
np.log), matching scikit-learn.The score is invariant to label permutations: only the contingency table matters.
def _encode_labels(labels):
labels = np.asarray(labels)
if labels.ndim != 1:
raise ValueError('labels must be 1D')
uniques, inv = np.unique(labels, return_inverse=True)
return uniques, inv
def contingency_matrix_np(labels_true, labels_pred):
"""Contingency matrix n_{ck} with shape (n_classes, n_clusters)."""
labels_true = np.asarray(labels_true)
labels_pred = np.asarray(labels_pred)
if labels_true.shape != labels_pred.shape:
raise ValueError('labels_true and labels_pred must have the same shape')
if labels_true.size == 0:
raise ValueError('empty label arrays')
classes, class_idx = _encode_labels(labels_true)
clusters, cluster_idx = _encode_labels(labels_pred)
cont = np.zeros((classes.size, clusters.size), dtype=np.int64)
np.add.at(cont, (class_idx, cluster_idx), 1)
return cont, classes, clusters
def entropy_from_counts(counts):
"""Shannon entropy of a discrete distribution given counts (natural log)."""
counts = np.asarray(counts, dtype=float)
total = counts.sum()
if total <= 0:
return 0.0
p = counts / total
p = p[p > 0]
return float(-np.sum(p * np.log(p)))
def mutual_info_from_contingency(cont):
"""Mutual information I(C;K) from contingency matrix (natural log)."""
cont = np.asarray(cont, dtype=float)
n = cont.sum()
if n <= 0:
return 0.0
pi = cont.sum(axis=1) # class marginals
pj = cont.sum(axis=0) # cluster marginals
i_idx, j_idx = np.nonzero(cont)
n_ij = cont[i_idx, j_idx]
return float(np.sum((n_ij / n) * np.log((n * n_ij) / (pi[i_idx] * pj[j_idx]))))
def homogeneity_completeness_v_measure_np(labels_true, labels_pred, beta=1.0):
if beta < 0:
raise ValueError('beta must be >= 0')
cont, _, _ = contingency_matrix_np(labels_true, labels_pred)
h_c = entropy_from_counts(cont.sum(axis=1))
h_k = entropy_from_counts(cont.sum(axis=0))
mi = mutual_info_from_contingency(cont)
homogeneity = 1.0 if h_c == 0.0 else mi / h_c
completeness = 1.0 if h_k == 0.0 else mi / h_k
homogeneity = float(np.clip(homogeneity, 0.0, 1.0))
completeness = float(np.clip(completeness, 0.0, 1.0))
if homogeneity == 0.0 or completeness == 0.0:
v = 0.0
else:
v = (1.0 + beta) * homogeneity * completeness / (beta * homogeneity + completeness)
return homogeneity, completeness, float(v)
def v_measure_score_np(labels_true, labels_pred, beta=1.0):
return homogeneity_completeness_v_measure_np(labels_true, labels_pred, beta=beta)[2]
# Sanity check: compare to scikit-learn on random labelings
def _check_against_sklearn(n_trials=200, n=200, n_classes=5, n_clusters=7):
for _ in range(n_trials):
y_true = rng.integers(0, n_classes, size=n)
y_pred = rng.integers(0, n_clusters, size=n)
h_np, c_np, v_np = homogeneity_completeness_v_measure_np(y_true, y_pred, beta=1.0)
h_sk = homogeneity_score(y_true, y_pred)
c_sk = completeness_score(y_true, y_pred)
v_sk = v_measure_score(y_true, y_pred)
if not (
np.isclose(h_np, h_sk, atol=1e-12, rtol=0)
and np.isclose(c_np, c_sk, atol=1e-12, rtol=0)
and np.isclose(v_np, v_sk, atol=1e-12, rtol=0)
):
return False, (h_np, h_sk, c_np, c_sk, v_np, v_sk)
return True, None
ok, debug = _check_against_sklearn()
ok
True
# V-measure (beta=1) equals normalized mutual information with arithmetic normalization
for name, labels_pred in cases.items():
v = v_measure_score(labels_true, labels_pred)
nmi = normalized_mutual_info_score(labels_true, labels_pred, average_method='arithmetic')
print(f'{name:30s} v={v:.6f} nmi(arithmetic)={nmi:.6f}')
perfect (permute ids) v=1.000000 nmi(arithmetic)=1.000000
one cluster (merge classes) v=0.000000 nmi(arithmetic)=0.000000
split each class (over-cluster) v=0.760188 nmi(arithmetic)=0.760188
random (3 clusters) v=0.005959 nmi(arithmetic)=0.005959
Visualizing what the metric sees: contingency tables#
V-measure only depends on how true classes and predicted clusters overlap.
A perfect clustering produces a contingency table that looks like a permutation of the identity matrix.
Under-clustering (merging classes) creates columns with many non-zero rows.
Over-clustering (splitting classes) creates rows with many non-zero columns.
fig = make_subplots(
rows=2,
cols=2,
subplot_titles=list(cases.keys()),
horizontal_spacing=0.08,
vertical_spacing=0.14,
)
for i, (name, labels_pred) in enumerate(cases.items()):
cont, classes, clusters = contingency_matrix_np(labels_true, labels_pred)
r, c = i // 2 + 1, i % 2 + 1
fig.add_trace(
go.Heatmap(
z=cont,
x=[str(k) for k in clusters],
y=[str(c_) for c_ in classes],
colorscale='Blues',
showscale=i == 0,
hovertemplate='true=%{y}<br>cluster=%{x}<br>count=%{z}<extra></extra>',
),
row=r,
col=c,
)
fig.update_layout(
title='Contingency tables: true class (rows) × predicted cluster (cols)',
height=650,
)
fig.update_xaxes(title_text='predicted cluster')
fig.update_yaxes(title_text='true class')
fig.show()
The \(\beta\) parameter: choose which mistake hurts more#
Larger \(\beta\) emphasizes completeness (do not split classes).
Smaller \(\beta\) emphasizes homogeneity (do not mix classes).
Below, compare how \(V_{\beta}\) changes for a merge-style mistake vs a split-style mistake.
betas = np.logspace(-2, 2, 250)
beta_rows = []
for beta in betas:
for name, labels_pred in {
'merge classes (one cluster)': labels_pred_one_cluster,
'split classes (over-cluster)': labels_pred_split_each_class,
}.items():
v = v_measure_score_np(labels_true, labels_pred, beta=float(beta))
beta_rows.append({'beta': beta, 'case': name, 'v_measure': v})
fig = px.line(
beta_rows,
x='beta',
y='v_measure',
color='case',
log_x=True,
title='Effect of beta on V-measure',
)
fig.add_vline(x=1.0, line_dash='dash', line_color='gray')
fig.update_layout(yaxis=dict(range=[0, 1.05]))
fig.show()
Using V-measure to optimize a simple algorithm (K-means)#
V-measure is not differentiable w.r.t. cluster assignments, so you typically do not optimize it with gradient descent.
Instead, you use it for model selection when labels are available, e.g.:
pick the number of clusters \(k\)
pick the best initialization / run among many
tune algorithm hyperparameters
Below we:
Generate a labeled 2D dataset (three Gaussian blobs)
Run a low-level NumPy K-means implementation for different \(k\)
Choose the \(k\) that maximizes V-measure
centers = np.array([[-2.0, 0.0], [2.0, 0.0], [0.0, 3.0]])
cluster_std = 0.6
n_per_center = 200
X_parts = []
y_parts = []
for i, mu in enumerate(centers):
X_i = rng.normal(loc=mu, scale=cluster_std, size=(n_per_center, 2))
y_i = np.full(n_per_center, i)
X_parts.append(X_i)
y_parts.append(y_i)
X = np.vstack(X_parts)
y_true = np.concatenate(y_parts)
perm = rng.permutation(X.shape[0])
X = X[perm]
y_true = y_true[perm]
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=y_true.astype(str),
title='Synthetic dataset (colored by true class)',
labels={'x': 'x1', 'y': 'x2', 'color': 'true class'},
)
fig.show()
def kmeans_np(X, k, *, n_init=10, max_iter=100, rng=None):
"""A small NumPy K-means implementation (Lloyd's algorithm).
Returns: (labels, centroids, inertia)
"""
X = np.asarray(X, dtype=float)
if X.ndim != 2:
raise ValueError('X must be 2D')
if k <= 0:
raise ValueError('k must be >= 1')
n_samples = X.shape[0]
rng = np.random.default_rng(rng)
best_inertia = np.inf
best_labels = None
best_centroids = None
for _ in range(n_init):
init_idx = rng.choice(n_samples, size=k, replace=n_samples < k)
centroids = X[init_idx].copy()
labels = None
for _ in range(max_iter):
d2 = ((X[:, None, :] - centroids[None, :, :]) ** 2).sum(axis=2)
new_labels = d2.argmin(axis=1)
if labels is not None and np.array_equal(new_labels, labels):
break
labels = new_labels
new_centroids = centroids.copy()
for j in range(k):
mask = labels == j
if not np.any(mask):
new_centroids[j] = X[rng.integers(0, n_samples)]
else:
new_centroids[j] = X[mask].mean(axis=0)
centroids = new_centroids
inertia = float(((X - centroids[labels]) ** 2).sum())
if inertia < best_inertia:
best_inertia = inertia
best_labels = labels.copy()
best_centroids = centroids.copy()
return best_labels, best_centroids, best_inertia
results = []
for k in range(2, 9):
labels_pred, centroids_k, inertia = kmeans_np(X, k, n_init=20, max_iter=200, rng=123)
h, c, v = homogeneity_completeness_v_measure_np(y_true, labels_pred, beta=1.0)
results.append(
{
'k': k,
'homogeneity': h,
'completeness': c,
'v_measure': v,
'inertia': inertia,
'labels_pred': labels_pred,
'centroids': centroids_k,
}
)
best = max(results, key=lambda r: r['v_measure'])
best['k'], best['v_measure']
(3, 1.0)
long = []
for r in results:
for m in ['homogeneity', 'completeness', 'v_measure']:
long.append({'k': r['k'], 'metric': m, 'value': r[m]})
fig1 = px.line(
long,
x='k',
y='value',
color='metric',
markers=True,
title='External model selection with labels: maximize V-measure',
)
fig1.update_layout(yaxis=dict(range=[0, 1.05]))
results_summary = [{'k': r['k'], 'inertia': r['inertia']} for r in results]
fig2 = px.line(
results_summary,
x='k',
y='inertia',
markers=True,
title='Inertia (K-means objective) always improves with larger k',
)
fig = make_subplots(rows=1, cols=2, subplot_titles=[fig1.layout.title.text, fig2.layout.title.text])
for tr in fig1.data:
fig.add_trace(tr, row=1, col=1)
for tr in fig2.data:
fig.add_trace(tr, row=1, col=2)
fig.update_layout(height=420, showlegend=True)
fig.update_yaxes(range=[0, 1.05], row=1, col=1)
fig.add_vline(x=best['k'], line_dash='dash', line_color='gray', row=1, col=1)
fig.show()
labels_best = best['labels_pred']
centroids_best = best['centroids']
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=labels_best.astype(str),
title=f"K-means clustering for k={best['k']} (colored by predicted cluster)",
labels={'x': 'x1', 'y': 'x2', 'color': 'cluster'},
)
fig.add_trace(
go.Scatter(
x=centroids_best[:, 0],
y=centroids_best[:, 1],
mode='markers',
marker=dict(color='black', size=12, symbol='x'),
name='centroids',
)
)
fig.show()
Pros, cons, and where it is useful#
Pros#
Permutation-invariant (cluster IDs do not matter)
Bounded in \([0,1]\) and easy to compare across runs
Works when the number of clusters differs from the number of classes
Decomposes into two interpretable parts (homogeneity vs completeness)
With \(\beta=1\) it equals NMI with arithmetic normalization
Cons / pitfalls#
Requires ground-truth labels (external metric)
Can be pushed toward extremes:
many tiny clusters → high homogeneity
one giant cluster → high completeness
Ignores geometry/distances: it only evaluates the final label assignments
Not differentiable: typically used for evaluation or hyperparameter search, not gradient-based training
Good use cases#
Benchmarking clustering algorithms on labeled datasets
Hyperparameter selection when you have a labeled validation set (semi-supervised model selection)
Comparing runs/initializations of a clustering method when labels are available
Common diagnostics and pitfalls#
Always inspect the contingency matrix: it explains why V-measure is high/low.
Check homogeneity and completeness separately before trusting the combined score.
Choose \(\beta\) based on what errors matter:
if splitting a class is very bad → use larger \(\beta\)
if mixing classes is very bad → use smaller \(\beta\)
If you do not have labels, V-measure cannot be computed; use internal metrics.
Exercises#
Increase the number of clusters in the random case and see how homogeneity changes.
Create an imbalanced dataset (one class much larger) and see how the score behaves.
Compare V-measure to adjusted mutual information (AMI) and adjusted Rand index (ARI).
Change \(\beta\) and pick the \(k\) that maximizes \(V_{\beta}\) in the K-means section.
References#
Rosenberg & Hirschberg (2007): V-measure: A conditional entropy-based external cluster evaluation measure
scikit-learn docs: https://scikit-learn.org/stable/modules/clustering.html#homogeneity-completeness-and-v-measure
API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.v_measure_score.html